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Abstract 

Background: We have compared 38 isolates of the SARS-CoV complete genome. The main goal 
was twofold: first, to analyze and compare nucleotide sequences and to identify positions of single 
nucleotide polymorphism (SNP), insertions and deletions, and second, to group them according to 
sequence similarity, eventually pointing to phylogeny of SARS-CoV isolates. The comparison is 
based on genome polymorphism such as insertions or deletions and the number and positions of 
SNPs. 

Results: The nucleotide structure of all 38 isolates is presented. Based on insertions and deletions 
and dissimilarity due to SNPs, the dataset of all the isolates has been qualitatively classified into 
three groups each having their own subgroups. These are the A-group with "regular" isolates (no 
insertions / deletions except for 5' and 3' ends), the B-group of isolates with "long insertions", and 
the C-group of isolates with "many individual" insertions and deletions. The isolate with the 
smallest average number of SNPs, compared to other isolates, has been identified (TWH). The 
density distribution of SNPs, insertions and deletions for each group or subgroup, as well as 
cumulatively for all the isolates is also presented, along with the gene map for TWH. 

Since individual SNPs may have occurred at random, positions corresponding to multiple SNPs 
(occurring in two or more isolates) are identified and presented. This result revises some previous 
results of a similar type. Amino acid changes caused by multiple SNPs are also identified (for the 
annotated sequences, as well as presupposed amino acid changes for non-annotated ones). Exact 
SNP positions for the isolates in each group or subgroup are presented. Finally, a phylogenetic tree 
for the SARS-CoV isolates has been produced using the CLUSTALW program, showing high 
compatibility with former qualitative classification. 

Conclusions: The comparative study of SARS-CoV isolates provides essential information for 
genome polymorphism, indication of strain differences and variants evolution. It may help with the 
development of effective treatment. 
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Background 

Severe Acute Respiratory Syndrome (SARS) is a new infec¬ 
tious disease reported first in the autumn of 2002 and 
diagnosed for the first time in March 2003 [1]. It is still a 
serious threat to human health and SARS coronavirus 
(CoV) has been associated with the pathogenesis of SARS 
according to Koch's postulate [2]. 

Significant research efforts have been made into investiga¬ 
tion of the SARS-CoV genome sequence, aimed at estab¬ 
lishing its origin and evolution to help eventually in 
preventing or curing the disease it causes. Although the 
task is a hard one, it opens up the opportunity, amongst 


others, for comparative investigation of different SARS- 
CoV isolates aimed at identification of genome regions 
properties expressing different levels of sequence poly¬ 
morphism [3-8]. 

The genome of SARS-CoV consists of a single positive 
RNA strand approximately 30 Kb in length, consisting of 
about 10 open reading frames (ORF), and about 10 inter- 
genic regions (IGRs). The first two overlapping ORFs at 
the 5' end encompass two-thirds of the genome, while the 
rest of the ORFs at the 3' end account for the remaining 
third. 


Table I: List of the SARS-CoV complete genome isolates investigated. Included are isolates' labels, IDs, accession numbers, length in 
nucleotides, dates of revisions considered and countries and sources of isolates. 


Label 

ID 

Accession No. 

Length 

Revision date 

Country/Source 

1. 

TWH 

Ap006557.1 

29727 

02-AUG-2003 

Taiwan: patient #01 


TWC2 

Ay362698.1 


13-AUG-2003 

Taiwan: Hoping Hospital 

2. 

TWC3 

Ay362699.1 

29727 

13-AUG-2003 

Taiwan: Hoping Hospital 

3. 

TWK 

Ap006559.1 

29727 

02-AUG-2003 

Taiwan: patient #06 

4. 

TWS 

Ap006560.1 

29727 

02-AUG-2003 

Taiwan: patient #04 

5. 

TWY 

Ap006561.1 

29727 

02-AUG-2003 

Taiwan: patient #02 

6. 

Urbani 

Ay278741.1 

29727 

I2-AUG-2003 

USA: Atlanta 

7. 

TWJ 

Ap006558.1 

29725 

02-AUG-2003 

Taiwan: patient #043 

8. 

TWC 

Ay32l 1 18.1 

29725 

26-JUN-2003 

Taiwan, first fatal case 

9 

WHU 

Ay394850.2 

29728 

12-JAN-2004 

China: Wuhan 

10. 

TWI 

Ay291451.1 

29729 

14-M AY-2003 

Taiwan 

1 1. 

Frankfurt 1 

Ay291315.1 

29727 

1 1 -JUN-2003 

Germany: Frankfurt 

12. 

FRA 

Ay310120.1 

29740 

I2-DEC-2003 

Germany: patient from Frankfurt 

13. 

HKU-39849 

Ay278491.2 

29742 

29-AUG-2003 

China: Hong Kong 

14. 

Tor2 

Ay274l 19.3 

29751 

16-MAY-2003 

Canada: Toronto, patient #2 



Nc_0047l8.3 


06-FEB-2004 

Canada: Toronto, patient #2 

15. 

HSR 1 

Ay323977.2 

29751 

15-OCT-2003 

Italy 

16. 

CUHK-Su 10 

Ay282752.2 

29736 

17-NOV-2003 

China: Hong Kong 

17. 

CUHK-WI 

Ay278554.2 

29736 

3 l-JUL-2003 

China: Hong Kong 

18. 

GZ50 

Ay304495.1 

29720 

05-NOV-2003 

China: Hong Kong 

19. 

AS 

Ay427439.1 

2971 1 

21 -OCT-2003 

Italy: Milan 

20. 

Sin2500 

Ay283 794.1 

2971 1 

12-AUG-2003 

Singapore 

21. 

Sin2679 

Ay283 796.1 

2971 1 

12-AUG-2003 

Singapore 

22. 

Sin2774 

Ay283 798.2 

2971 1 

02-OCT-2003 

Singapore 

23. 

Sin2677 

Ay283795.1 

29705 

12-AUG-2003 

Singapore 

24. 

Sin2748 

Ay283 797.1 

29706 

12-AUG-2003 

Singapore 

25. 

BJ0I 

Ay278488.2 

29725 

01-MAY-2003 

China: Beijing 

26. 

BJ02 

Ay278487.3 

29745 

05-JUN-2003 

China: Beijing 

27. 

BJ03 

Ay278490.3 

29740 

05-JUN-2003 

China: Beijing 

28. 

BJ04 

Ay279354.2 

29732 

05-JUN-2003 

China: Beijing 

29. 

Taiwan TCI 

Ay3 38174.1 

29573 

28-JUL-2003 

Taiwan 

30. 

Taiwan TC2 

Ay338175.1 

29573 

28-JUL-2003 

Taiwan 

31. 

Taiwan TC3 

Ay348314.1 

29573 

29-JUL-2003 

Taiwan 

32. 

GD0I 

Ay278489.2 

29757 

I8-AUG-2003 

China: Beijing 

33. 

SZ3 

Ay304486.1 

29741 

05-NOV-2003 

China: Hong Kong 

34. 

SZI6 

Ay304488.1 

29731 

05-NOV-2003 

China: Hong Kong 

35. 

ZJ0I 

Ay297028.1 

29715 

19-MAY-2003 

China: Beijing 

36. 

ZMY 1 

Ay351680.1 

29749 

03-AUG-2003 

China: Guangdong 


Page 2 of 14 

(page number not for citation purposes) 




BMC Bioinformatics 2004, 5 


http://www.biomedcentral.eom/1471-2105/5/65 


CD 

-P U 
cd CD 
rH ,0 

O B 
m 3 
h d 

1 - 6,11 

7 

8 

9 

10 
12 
13 
14-15: 
16,17: 
18: 

19-22: 


TWH position 

: TWH (TWC2) 

TWJ 

TWC 

WHU 

TW1 

FRA 

HKU-39849 

Tor2 

CUHK-SulO 

GZ50 

AS 


00 CN 00 

CO 00 O iH 

i-H l+~ 00 00 

h- 1^ 1^ 

CN-CN-CN CN 


CO 

00 

00 

h- 

CN 


atattaggtt 

atattaggtt 

atattaggtt 

atattaggtt 

atattaggtt 

atattaggtt 

atattaggtt 

atattaggtt 


tttacctacc. 
tttacctacc. 
tttacctacc. 
tttacctacc. 
tttacctacc. 
tttacctacc. 
tttacctacc. 
tttacctacc. 
ctacc. 
ctacc. 
tacc. 


. ta. 


aaaett, 


I I 

. tteteta. 

I 


h- 

CN 

h- 

CT> 

CN 


I 

• tgac 

• tgac 
.tgac 
.tgac aaa 
.tgac aa 

.tgac aaaaaaaaaa aaa 

.tgac aaaaaaaaaa aaaaa 

.tgac aaaaaaaaaa aaaaaaaaaa aaaa 

.tgac aaaaaaaaaa aaaaaaaaaa aaaa 

.tgac aaaaaaaa 

.tgac 


23 

Sin.2677 







tacc 






— 

— 

- . . 

. 1 • 

• 








. 

.tgac 




24 

Sin.2748 







tacc 










— 




. 1 • 





.tgac 




25 

BJ01 








c 














. 1 • 





.tgac 

aaaaaaaaaa 

aaaaaaa 

26 

BJ02 




atattaggtt 

tttacctacc 














.1. 





.tgac 

aaaaaaaaaa 

aaaaaaaaaa 

27 

BJ03 





taggtt 

tttacctacc 














. 1 • 





.tgac 

aaaaaaaaaa 

aaaaaaa 

28 

BJ04 








tacc 














. 1 • 





.tgac 

aaaaaaaaaa 

aaaaaaaaaa 

29- 

-31: Taiwan TCI 






(69—) 














.1. 






(85- 

-) 


32 

GD01 








tacc 















. 

. . 

. . 

. 

tgac 

aaaaaaaaaa 

aaaaaaa 

33 

SZ3 








ctacc 














.> 

• 

. 

• 

• 

tgac 




34 

SZ16 








ctacc 















• 

• 

. 

. 

. .(10- 

-) 


35 

ZJ01 








cctacc 

. * 

* * 

* * 

* 

* * * 

* 

* * 

* * 

* * 

* 

* 

* 

* 

* * 

* * 

* * 

* * 

. 

t 




36 

ZMY 1 








atatt 

. # 

# # 

# # 

# 

# # # 

# 

# # 

# # 

# # 

# 

# 

# 

# 

# # 

# # 

# # 

# # 

• 

.tgac 

aa 



ZJ01 * * * 

= a. 


9- 

t 


t . 


a. . - 

(a). 

. t . 


a. 

-(a) 




















CN 


00 

CD 


T — 1 


CN 

LO 

00 


CO 

CO 




















CD 


00 

00 


CD 


n- 

CO 

CD 























LO 



CO 





LO 

LO 


I s - 





















00 


CN 

CO 


CO 


CO 

LO 

LO 


N- 

00 






















T—1 

T —1 


T—1 


1—1 

T—1 

T — 1 


CN 

CN 


















ZMY 1 # # # 

-c. 

a. 

a. 

t . 

t . 

9- 

9- 

c. a. 

a. 

a. 

t. 

-(t). -(t) 

. C. 

C. 

C. 


c. 


C. 

C. 

C. 

c. 

t . 

C 

c. 

9 • 

-(a). - 

-(a) 



T—1 

o 

1 

CT> 

■st 1 

LO 


CD t— l 

t"- 

00 

CO 


CT> 

00 

a i 

T—1 

T—1 


o 


O 

T—1 

LO 

'Ct* 

O 

CO 

o 

o 


CO 



CO 


LO 

7—1 

CN 

■st 1 

T—1 

tH CO 

"rH 

00 

I s - 


00 

CT> 

T—1 

N- 

LO 


o 


CN 

CD 


LO 

O 

N- 

CO 

00 

o 

t—1 



o 

o 

o 

LO 

LO 

00 

CN 

00 CO 

■st 1 

h- 

o 


LO 

LO 

I s - 

N- 

o 


T—1 


T—1 

T—1 

CN 

CN 

T—1 

CD 

I s - 

CN 

I s - 

I s - 



T—1 

T—1 

1 

CN 

CN 

CO 


‘st H CD 

CD 

h- 

00 


O 

O 

T—1 

T—1 



■st< 



'Ct 1 



CD 

O 

CN 

LO 

00 

00 















T—1 

T—1 

T—1 

T—\ 

T—1 


T—1 


T—1 

T—1 

T—1 

T—1 

T—1 

CN 

CN 

CN 

CN 

CN 


>- = cct actggttacc aacctgaatg gaatat 

Same structure genomes: TWC3,TWK,TWS,TWY,Urbani and Frankfurt 1 as TWH; HSR 1 as Tor 2; CUHK-W1 as CUHK-SulO; Sin2500, Sin2679 and Sin 
2774 as AS; Taiwan TC2 and Taiwan TC3 as Taiwan TCI. 


Figure I 

Comparison of nucleotide structures of SARS-CoV complete genome isolates. Insertions are denoted as empha¬ 
sized (italic) and > , deletions by minus sign Positions are given in relation to the TWH isolate. The two isolates with a 
large number of individual insertions (ZJOI, ZMY I) are given separately, with exact positions of insertions and deletions. 


We investigated 38 isolates of the SARS-CoV complete 
genome (two pairs of which were identical), sequenced 
and published by October 31 st 2003 (with updated revi¬ 
sions up to February 20 th , 2004). Sequences were taken 
from the PubMed NCBI Entrez site [9] in gbk and fasta 
formats (Table 1). The main goal was twofold: first, to 
analyze and compare nucleotide sequences, to identify 
SNPs positions, insertions and deletions, and second, to 
group them according to sequence similarity, eventually 
pointing to phylogeny of SARS-CoV isolates. 

According to the length of isolates (insertions and dele¬ 
tions) and the presence of SNPs, we classified them into 
three main groups with subgroups: "regular" isolates with 
no insertions or deletions (with different numbers of 
SNPs), isolates with "long insertions" and isolates with 
"many individual" insertions and deletions (with differ¬ 
ent positions of SNPs), which is close to phylogenetic 
analysis results. 


Results and discussion 

Genome polymorphism 

All the sequences are between 29573 and 29757 in length 
(Table 1), with a high degree of similarity (>99% pair¬ 
wise). Still, they can be differentiated on the basis of 
sequence polymorphism (insertions and deletions), 
number and sites of SNPs [8]. Results of the comparison 
of genome primary structure of the analyzed isolates are 
given in Figure 1. 

Analysis of genomic polymorphism of the isolates 
resulted in the following facts 

I) Some of the isolates are nucleotide-identical or almost 
identical. There are two pairs of nucleotide-identical iso¬ 
late sequences: (TWH, TWC2) and Tor2 (with accession 
numbers Ay274119, Nc_004718). Therefore, instead of 
38, we consider the dataset to contain 36 isolates. Further, 
the isolate TWC3 differs in just one position with TWH 
(see table in additional file 1), which is about randomly 
expected [11]. Isolates Frankfurt 1 and FRA are identical 
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up to the poly-"a" of length 13 present at the 3' end of FRA 
(Figure 1). 

II) Similarity analysis showed that a significant number of 
isolates have the same length (29727 bases), the same 
beginning and ending subsequences (that seem to be 
exact starts and ends of the complete SARS-CoV genome 
up to the poly-"a" at the 3' end), thus forming a kind of 
referent group; these are the isolates TWH, TWC3, TWK, 
TWS, TWY, Urbani, Frankfurt 1 (Figure 1). The fully 
sequenced isolate TWH then has been chosen as the refer¬ 
ent isolate for sequence comparisons since its average 
number of SNPs compared to other isolates is the small¬ 
est. For example, TWH and Urbani have an average 
number of SNPs 15.7 and 17.6 respectively for all the iso¬ 
lates, and 5.7 and 10.5 respectively for the referent group. 
For SNPs see the tables in the additional files 1 and 2. 

III) Most isolates, compared to TWH, are shorter at the 

5'end (e.g., Sin2500, Sin2679, Sin2774, Sin2677, 

Sin2748, AS), have various length poly-"a" strings at the 
3' end (e.g., Tor2, HSR1, FRA, BJ02, TW1, HKU-39489, 
WHU), or both (BJ01, BJ03, BJ04, CUHK-W1, CUHK- 
SulO). Three of the isolates, Taiwan TCI, Taiwan TC2, Tai¬ 
wan TC3, have both starting and ending deletions (at the 
5' end 69, at the 3' end 85 nucleotides). Several isolates 
(e.g. TWJ, TWC, Sin2677, Sin2748) have some short dele¬ 
tions inside the sequence (Figure 1). 

IV) There is a group of isolates that have significant length 
insertions (29 nucleotides) inside the sequence. These are 
the isolates GD01, SZ3, SZ16. A significant number of 
individual insertions have been identified in ZJ01 and 
ZMY 1 isolates (Figure 1, additional files 3,4,5). 

Among the SNP contents of isolates, there is a significant 
difference in the number of SNPs for different pairs of iso¬ 
lates. For TWH as the referent isolate, this number varies 
from 1 to 80 SNPs. Isolates may be classified into three 
groups based on the number of SNPs with TWH (Figure 
2 ): 

1. with less than 15 (TWC3, TWK, TWS, TWY, Urbani, 
TWJ, TWC, TW1, Tor2, HSR1, CUHK-SulO, AS, Sin2500, 
Sin2679, Sin2774, Sin2677, Sin2748, Taiwan TCI, Tai¬ 
wan TC2, Taiwan TC3, Frankfurtl, FRA, HKU-39849, 
CUHK-W1), 

2. between 15 and 30 (WHU, GZ50, BJ01-BJ04, ZJ01), 

3. with equal to or greater than 30 SNPs (GD01, SZ3, 
SZ16, ZMY 1). 

Finally, besides the number, there are differences in posi¬ 
tions of SNPs (potential mutation sites). In order to avoid 


nucleotide changes that probably arose during propaga¬ 
tion of the virus in cell culture and sequencing, Figure 3 
represents positions (on the relative scale of all isolates 
and on TWH scale) where two or more SNPs occurred, not 
taking into consideration isolates with long insertions 
(GD01, SZ3 and SZ16). The positions of multiple SNPs of 
these three isolates, similar as far as these three are con¬ 
cerned, are highly different from all the others and are rep¬ 
resented in Figure 4. These results coincide with those 
published in Marra et al's paper [4] for Urbani and Tor2 
isolates, but differ from those published in Ruan's paper 
[8] for the 14 isolates therein analyzed (Sin-group, BJ- 
group, Tor2, Urbani, CUHK-W1, HKU-39849, GD01), 
which were obviously based on different revisions of the 
PubMed NCBI Entrez database [9]; lengths of the 
sequences Tor2, CUHK-W1, GD01, BJ01-BJ04 differ from 
the revisions we analyzed and consequently in some 
nucleotides and the number of base changes at given posi¬ 
tions. Differences include the following positions (based 
upon Urbani and TWH SARS-CoV): 2601 (Tor2 T instead 
of C, BJ04 T instead of missing base), 7919 (BJ03 C 
instead of T), 8559 (BJ04 T instead of A), 8572 (BJ01 T 
instead of G, GD01 G instead of T), 9404 (BJ04 T instead 
of missing base), 9479 (BJ04 T instead of missing base), 
9854 (BJ04 T instead of missing base), 19838 (GD01 G 
instead of A), 21721 (GD01, BJ01, A instead of missing 
base, BJ04 G instead of missing base), 22222 (BJ04 C 
instead of N), 27243 (GD01 T instead of C, BJ03 T instead 
of N), 29279 (all A's). The results obtained also differ 
from Hsueh et al. [12] regarding nucleotides in HKU- 
39849 isolate on positions 7746, 9404, 9479, 17564, 
17846, 19064, 21721, 22222, 27827. 

Additional file 1,2,3,4,5 represent SNPs for all the isolates 
in all five groups, whether they occur in ORFs or IGR (for 
annotated isolates), as well as the number of SNPs in 
ORFs and SNPs in IGR, per isolate. The total number of 
SNPs is 312 (only 2 in IGRs: TWH positions 27812 for the 
isolate Taiwan TC3 and 27827 for the isolates BJ01 and 
CUHK-W1). The average number of SNPs per isolate is 
15.7 and significant difference from the average shows 
TWC3 (just 1 SNP) and ZMY 1 (even 80). 

Grouping of isolates 

The isolates from the dataset considered may be classified 
according to their sequence polymorphism and SNP con¬ 
tents properties just described. At first, properties (III, IV) 
may result in three different groups (Figure 2): 

A. "regular isolates" whose nucleotide structure is close to 
the referent group (different 5' and 3' ends, short deletion, 
individual insertion): TWH, TWC3, TWK, TWS, TWY, 
Urbani, TWJ, TWC, TW1, Tor2, HSR1, CUHK-SulO, AS, 
Sin2500, Sin2679, Sin2774, Sin2677, Sin2748, Taiwan 


Page 4 of 14 

(page number not for citation purposes) 



BMC Bioinformatics 2004, 5 


http://www.biomedcentral.eom/1471-2105/5/65 


SARS-CoV isolates 
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Figure 2 

Structural tree for SARS-CoV isolates. The tree is based on qualitative analysis of sequence variation of 36 isolates. 


TCI, Taiwan TC2, Taiwan TC3, WHU, Frankfurt 1, FRA, 
HKU, CUHK-W1, GZ50 and BJ01-BJ04 (Figure 5, 6a) 

B. isolates with "long insertions": GD01, SZ3 and SZ16 
(Figure 6b) and 

C. isolates with "many individual" insertions: ZJ01 and 
ZMY 1 (Figure 7a, 7b). 


Further, SNPs properties (1-3) may divide A group into 
A1 and A2, and C group into Cl and C2 subgroups: 

Al. TWII, TWC3, TWK, TWS, TWY, Urbani, TWJ, TWC, 
TW1, Tor2, HSR1, CUHK-SulO, AS, Sin2500, Sin2679, 
Sin2774, Sin2677, Sin2748, Taiwan TCI, Taiwan TC2, 
Taiwan TC3, Frankfurtl, FRA, HKU and CUHK-W1 (Fig¬ 
ure 5) 
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Figure 3 

Positions with two or more SNPs in A and C groups with amino acid changes. Positions are represented on the rel¬ 
ative scale of all the isolates and on the TWH scale. Isolates from group B have not been counted, since their positions of SNPs 
while coordinated among them, are highly different from all the others. SNPs are in bold type. Proteins associated with SNPs 
are represented based on TWH annotation. IDs of annotated isolates are in grey boxes. Positions of SNPs causing amino acid 
changes, together with amino acid and their properties' change [16] are in grey. Legend of A. Ac. properties: Hprhydrophobic, 
Ar:aromatic, Ap:aliphatic, Pipolar, NCh: negative charged, PChipositive charged, S: small, T:tiny 
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Figure 4 

Positions with two or more SNPs in B group with amino acid changes. Only SNPs in B group isolates, regarding 
TWH, have been counted. The same notation is applied as in Figure 3. 
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Figure 5 

Density distribution of SNPs, insertions and deletions in the isolates of AI group. SNPs are represented above the 
line, insertions below the line, upward oriented, and deletions below the line, downward oriented. The TWH scale is used. 
The same holds for Figures 6,7,8. 


A2. WHU, BJ01-BJ04 and GZ50 (Figure 6a) 

Cl: ZJ01 (Figure 7a) 

C2: ZMY 1 (Figure 7b) 

Finally, the positions of SNPs will move CUHK-W1 from 
A1 into A2 group (more than 50% of common SNP posi¬ 
tions) while WHU will move from A2 into A1 (less than 
30% of common SNP positions), giving the final group¬ 
ing of isolates presented as a structural tree (Figure 2): 

Al. TWH, TWC3, TWK, TWS, TWY, Urbani, TWJ, TWC, 
TW1, Tor2, HSR1, CUHK-SulO, AS, Sin2500, Sin2679, 
Sin2774, Sin2677, Sin2748, Taiwan TCI, TC2, TC3, 
Frankfurt 1, FRA, HKU and WHU (Figure 5 and the addi¬ 
tional file 1) 


A2. CUHK-W1, GZ50 and BJ01-BJ04 (Figure 6a and the 
additional file 2) 

B. GD01, SZ3 and SZ16 (Figure 6b and the additional file 

3 ) 

Cl. ZJ01 (Figure 7a and the additional file 4) 

C2. ZMY 1 (Figure 7b and the additional file 5). 

Although qualitative in nature, the structural tree turns 
out to be close to the quantitative grouping which is a 
basis for (computational) phylogenetic classification. 

Tables in additional files 1,2,3,4,5 represent SNPs, inser¬ 
tions and deletions in groups A-C (see additional files 1 
for isolates of Al group, on the relative and TWH scale, 
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Figure 6 

(a and b). Density distribution of SNPs, insertions and deletions in the isolates of A2 and B groups. In A2 group 
there are no insertions / deletions. In B group there are large insertions in GD0I, SZ3 and SZI6 isolates. 


additional files 2 for isolates of A2 group and TWH, on the 
TWH scale, additional files 3 for isolates of group B with 
TWH, and additional files 4,5 for isolates of Cl, C2 
groups, respectively). Figures 5,6,7 represent density dis¬ 
tribution of SNPs, insertions and deletions on the TWH 
scale, for the same groups of isolates. Figure 8 represents 
the overall density distribution of SNPs, insertions and 
deletions for all the 36 isolates, along with the gene map 
for TWH (which is quite similar to gene maps of other iso¬ 
lates). Density distributions do not show regularities yet 
(with respect to the number of available sequences) that 
could provide for precise statistical characterization. Still, 
they exhibit crowding regions close to the 3' end which is 
also characterized by the presence of a number of proteins 
of unknown function. 

It can also be noted that the proposed grouping of 36 iso¬ 
lates, based on different criteria, still conserves the previ¬ 
ous classification T-T-T-T / C-G-C-C [8]. All the isolates 
from groups At and C have T-T-T-T configuration, while 
all the isolates from groups A2 and B have C-G-C-C con¬ 
figuration, except for GZ50, BJ04 being TGCC (Figure 2, 


Figure 9). The two sequence variants correspond to the 
epidemiological spread, so that those that originated in 
the Hotel M in Hong Kong have the T-T-T-T configuration 
- covering At, C groups in our classification - Canada 
(Tor2), Singapore (all Sins), Frankfurt, Taiwan, Hong 
Kong (HKU39849), Hanoi (Urbani), Italy (HSR1), China 
(ZJ01), etc, and others having C-G-C-C configuration (A2, 
B in our classification) which originated in Guangdong, 
China (GD01 and GZ50), Hong Kong (CUHK-W1, SZ3 
and SZ16), Beijing (BJ01-BJ04). The fact that the enlarged 
number of isolates exhibits the same properties relating to 
the four loci supports the assumption that the mutations 
could not have arisen by chance base substitution during 
propagation in cell culture and the sequencing procedure 
[ 8 ], 

Changes in amino acids 

We analyzed amino acid changes in proteins for the anno¬ 
tated isolates (19 out of 36), and presumed proteins in 
non-annotated ones for multiple SNPs in all the isolates. 
Results of the analysis are represented in Figures 3 and 4. 
Figure 3 shows that silent mutations occurred in envelope 
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Figure 7 

(a and b). Density distribution of SNPs, insertions and deletions in the isolates of CI, C2 groups. 
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Figure 8 

The overall density distribution of SNPs, insertions and deletions along with the gene map for TWH. 
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Figure 9 

Phylogenetic tree of 36 SARS-CoV complete genome isolates. Distances represent degree of sequence variation. The 
largest distance is associated with ZMY I, followed by ZJOI isolate (groups Cl, C2). Groups Al, A2 and B are clearly distin¬ 
guished. The tree has been obtained using CLUSTALW and PhyloDraw programs. 


protein E, while nucleotide changes resulted in amino 
acid changes in spike (S), membrane (M) and 
nucleocapside (N) proteins. All three SNPs in the spike 
protein are situated in the outer membrane region and not 
within the potential epitope region (amino acid position 
469-882) as proposed by Ren Y. et al. [13]. Amino acid 


changes occurred in two multiple SNPs in M protein, one 
multiple SNPs in N protein and 7 (out of 13) multiple 
SNPs of the polyprotein lab, as well as in one multiple 
SNP of a hypothetical protein, while the silent mutations 
occurred in three hypothetical proteins. Figure 3 also rep¬ 
resents properties of the corresponding amino acids 
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resulted by SNPs. The only significant change in amino 
acid properties is in S protein Gly->Asp (A2, B groups, i.e., 
in CUHK-W1, GZ50, BJ01-BJ03, GD01, SZ3 and SZ16 iso¬ 
lates) and hypothetical protein Cys->Arg (the same iso¬ 
lates, BJ04 in addition). The only addition in non- 
annotated sequences is in hypothetical protein following 
S protein in TWH, exhibiting silent change, and in non- 
annotated BJ02 and BJ03, corresponding to the hypothet¬ 
ical protein, Gly—>Glu. Similar analysis can be done for 
amino acid changes corresponding to SNPs at positions 
specific for B group isolates (Figure 4). Taking into 
account the only annotated isolate GD01, there are five 
amino acid changes in polyprotein lab, two amino acid 
properties changes in S protein (Ser—>Leu and Tyr—>Asp, 
the second being within the epitope region), one amino 
acid change in M protein and one amino acid property 
change (Cis^Arg) in BGI-PUP. 

Phylogenetic analysis 

The SARS-CoV isolates have been multialigned using the 
CLUSTALW program [10] as the very first step in 
obtaining a phylogenetic tree. The aligned sequences have 
been submitted then to CLUSTALW for bootstrapping 
and phylogenetic tree production. Enlargement of the 
sequence set resulted in the refinement of the phyloge¬ 
netic tree produced, as compared to previous results such 
as Ruan [8] and Zhang&Zheng [14], obtained for 14 and 
16 isolates, respectively. The phylogenetic tree obtained, 
drawn using the PhyloDraw program [15], is represented 
in Figure 9. It is similar to our structural tree based on 
qualitative analysis of the isolates (Figure 2). 

The results of the analysis of dissimilarities, described in 
previous paragraphs, are in accordance with the 
alignment obtained by CLUSTALW, but regrouped and 
formatted in a way that facilitates further interpretation 
and application. 

Conclusion 

Comparative analysis of genome sequence variations of 
38 SARS-CoV isolates resulted in some conclusions that 
might be of interest in further investigation of the SARS- 
CoV genome: 

1. All of the SARS-CoV isolates are highly homologous 
(more than 99% pairwise). Most of them have similar 
nucleotide structure, with the same 5' and 3' ends and 
poly-"a" at the 3' end of different length (0-24), some of 
them with a single short deletion close to the 3' end of the 
sequence; out of 312 SNPs in total, only two are in IGRs. 

2. Three of the 38 isolates have long insertions within the 
sequence; 


3. Two of the isolates have a large number of individual 
insertions / deletions, exhibiting different SNP positions; 

4. All the isolates may be grouped according to sequence 
polymorphism into three groups (with up to two sub¬ 
groups), reflecting their similarities / dissimilarities. Since 
the isolate sequences have a high degree of homology, 
different properties of groups are represented in a more 
transparent way in the classification tree obtained by such 
a qualitative analysis, than in a bootstrapped phylogenetic 
tree obtained from multialigned sequences using the 
CLUSTALW program [10]. 

5. The total number of amino acid changes caused by mul¬ 
tiple SNPs is 15 (in isolates of A, C groups) and 34 in 
isolates of B group. The total number of silent mutations 
is 10 (for A, C groups) and 7 (for B group). 

6. Since S protein is of special interest regarding its recep¬ 
tor affinity and antigenecity, it is interesting to notice that 
all amino acid properties' changes are located in its outer 
membrane region, one for A, C groups and two for B 
group. 

7. The results obtained may be useful in further investiga¬ 
tion aiming at identification of SARS-CoV genome regions 
responsible for its infectious nature. 

Methods 

Dataset 

We investigated the complete genomes of 38 SARS-CoV 
isolates. Nucleotide sequences are taken from the PubMed 
NCBI Entrez database [9] in gbk and fasta formats (Table 
1 ). 

The coverage included all the isolates published by Octo¬ 
ber 31 st 2003 (with updated revisions). The identifiers, 
accession numbers, genomic size (in nucleotides), revi¬ 
sion dates and country or source of the isolates considered 
are included in the table, together with labels as referred 
in this paper. The fully sequenced isolate TWH has been 
chosen as the referent isolate, since its average number of 
SNPs was the lowest as compared to all other isolates. 

Methods for similarity analysis 

For similarity analysis of isolates, the following procedure 
has been applied consisting of two steps: 

1. identification of structurally identical parts of isolates, 

1. e., insertion and deletion sites 

2. identification of SNPs in structurally identical parts. 

Step 1 has been carried out by a function performing sim¬ 
ilarity analysis of subsequences of a given length (e.g., 100 


Page 12 of 14 

(page number not for citation purposes) 



BMC Bioinformatics 2004, 5 


http://www.biomedcentral.eom/1471-2105/5/65 


bps), and identifying significantly non-matching strings 
as being inserted in the corresponding sequence (i.e. 
deleted from the other). Since significant number of iso¬ 
lates have the same length (29727 bases) and starting and 
ending subsequences (that seem to be the exact starts and 
ends of the complete SARS-CoV genome up to the poly- 
11 a" at the 3' end), they may be considered as forming a 
representative group. The nucleotide structure of all other 
isolates was analyzed with respect to this representative 
group. For each pair of isolates ( x,y ) (v from the represent¬ 
ative group), a file InsDek-y has been produced contain¬ 
ing positions and lengths of each of the insertions or 
deletions in the isolate y. 

Step 2 has been carried out by comparing structurally 
identical parts (of the same length) of pairs of isolates. 
The starting and ending positions of those parts have been 
taken from the file InsDek-y (for comparison of x and y), 
produced in step 1. The procedure returns results in a file 
with SNPs in the two sequences (files Mismv-y). 

We also used the CLUSTALW program [10] for multialign¬ 
ment as a control process, as well as for phylogenetic 
investigations. 

Methods for phylogenetic investigation 

In order to use similarity analysis results for drawing any 
phylogenetic conclusions about the SARS-CoV genome 
dataset, a CLUSTALW [10] multialigned output has been 
generated and a bootstrapped phylogenetic tree has been 
produced and drawn using the PhyloDraw program [15]. 
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TWH scales. IDs of annotated isolates are in grey boxes; SNPs in ORFs 
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2105-5-65-S2.xls] 

Additional File 3 

Positions of SNPs and insertions in B group. The exact positions on all four 
scales (TWH, GD01, SZ3 and SZ16) are given. ID of the only annotated 
isolate (GD01) is in grey box; SNPs in ORFs (or corresponding to those 
in ORFs, for non-annotated isolates) are in red bold. The total number of 
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Additional File 5 
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